NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Threshold Pivoting for Dense LU Factorization

https://doi.org/10.1109/ScalAH56622.2022.00010

Lindquist, Neil; Gates, Mark; Luszczek, Piotr; Dongarra, Jack (November 2022, IEEE)

Full Text Available
PAQR: Pivoting Avoiding QR factorization

https://doi.org/10.1109/IPDPS54959.2023.00040

Sid-Lakhdar, Wissam; Cayrols, Sebastien; Bielich, Daniel; Abdelfattah, Ahmad; Luszczek, Piotr; Gates, Mark; Tomov, Stanimire; Johansen, Hans; Williams-Young, David; Davis, Timothy; et al (May 2023, 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS))
Proposed Consistent Exception Handling for the BLAS and LAPACK

https://doi.org/10.1109/Correctness56720.2022.00006

Demmel, James; Dongarra, Jack; Gates, Mark; Henry, Greg; Langou, Julien; Li, Xiaoye; Luszczek, Piotr; Pereira, Weslley; Riedy, Jason; Rubio-Gonzalez, Cindy (November 2022, In Sixth International Workshop on Software Correctness for HPC Applications (Correctness 2022)}, a workshop of ACM/IEEE SC 2022 Conference (SC'22), Dallas, TX, USA, November 13-18, 2022.)

Numerical exceptions, which may be caused by overflow, operations like division by 0 or sqrt(−1), or convergence failures, are unavoidable in many cases, in particular when software is used on unforeseen and difficult inputs. As more aspects of society become automated e.g., self-driving cars, health monitors, and cyber-physical systems more generally, it is becoming increasingly important to design software that is resilient to exceptions, and that responds to them in a consistent way. Consistency is needed to allow users to build higher-level software that is also resilient and consistent (and so on recursively). In this paper we explore the design space of consistent exception handling for the widely used BLAS and LAPACK linear algebra libraries, pointing out a variety of instances of inconsistent exception handling in the current versions, and propose a new design that balances consistency, complexity, ease of use, and performance. Some compromises are needed, because there are preexisting inconsistencies that are outside our control, including in or between existing vendor BLAS implementations, different programming languages, and even compilers for the same programming language. And user requests from our surveys are quite diverse. We also propose our design as a possible model for other numerical software, and welcome comments on our design choices.
more » « less
Full Text Available
Evolution of the SLATE linear algebra library

https://doi.org/10.1177/10943420241286531

Gates, Mark; Abdelfattah, Ahmad; Akbudak, Kadir; Al_Farhan, Mohammed; Alomairy, Rabab; Bielich, Daniel; Burgess, Treece; Cayrols, Sébastien; Lindquist, Neil; Sukkari, Dalal; et al (September 2024, The International Journal of High Performance Computing Applications)

SLATE (Software for Linear Algebra Targeting Exascale) is a distributed, dense linear algebra library targeting both CPU-only and GPU-accelerated systems, developed over the course of the Exascale Computing Project (ECP). While it began with several documents setting out its initial design, significant design changes occurred throughout its development. In some cases, these were anticipated: an early version used a simple consistency flag that was later replaced with a full-featured consistency protocol. In other cases, performance limitations and software and hardware changes prompted a redesign. Sequential communication tasks were parallelized; host-to-host MPI calls were replaced with GPU device-to-device MPI calls; more advanced algorithms such as Communication Avoiding LU and the Random Butterfly Transform (RBT) were introduced. Early choices that turned out to be cumbersome, error prone, or inflexible have been replaced with simpler, more intuitive, or more flexible designs. Applications have been a driving force, prompting a lighter weight queue class, nonuniform tile sizes, and more flexible MPI process grids. Of paramount importance has been building a portable library that works across several different GPU architectures – AMD, Intel, and NVIDIA – while keeping a clean and maintainable codebase. Here we explore the evolving design choices and their effects, both in terms of performance and software sustainability.
more » « less
Task-graph scheduling extensions for efficient synchronization and communication

https://doi.org/10.1145/3447818.3461616

Bak, Seonmyeong; Hernandez, Oscar; Gates, Mark; Luszczek, Piotr; Sarkar, Vivek (June 2021, 35th ACM International Conference on Supercomputing (ICS))

Task graphs have been studied for decades as a foundation for scheduling irregular parallel applications and incorporated in many programming models including OpenMP. While many high-performance parallel libraries are based on task graphs, they also have additional scheduling requirements, such as synchronization within inner levels of data parallelism and internal blocking communications. In this paper, we extend task-graph scheduling to support efficient synchronization and communication within tasks. Compared to past work, our scheduler avoids deadlock and oversubscription of worker threads, and refines victim selection to increase the overlap of sibling tasks. To the best of our knowledge, our approach is the first to combine gang-scheduling and work-stealing in a single runtime. Our approach has been evaluated on the SLATE high-performance linear algebra library. Relative to the LLVM OMP runtime, our runtime demonstrates performance improvements of up to 13.82%, 15.2%, and 36.94% for LU, QR, and Cholesky, respectively, evaluated across different configurations related to matrix size, number of nodes, and use of CPUs vs GPUs.
more » « less
Full Text Available
A Set of Batched Basic Linear Algebra Subprograms and LAPACK Routines

https://doi.org/10.1145/3431921

Abdelfattah, Ahmad; Costa, Timothy; Dongarra, Jack; Gates, Mark; Haidar, Azzam; Hammarling, Sven; Higham, Nicholas J.; Kurzak, Jakub; Luszczek, Piotr; Tomov, Stanimire; et al (June 2021, ACM Transactions on Mathematical Software)
null (Ed.)
This article describes a standard API for a set of Batched Basic Linear Algebra Subprograms (Batched BLAS or BBLAS). The focus is on many independent BLAS operations on small matrices that are grouped together and processed by a single routine, called a Batched BLAS routine. The matrices are grouped together in uniformly sized groups, with just one group if all the matrices are of equal size. The aim is to provide more efficient, but portable, implementations of algorithms on high-performance many-core platforms. These include multicore and many-core CPU processors, GPUs and coprocessors, and other hardware accelerators with floating-point compute facility. As well as the standard types of single and double precision, we also include half and quadruple precision in the standard. In particular, half precision is used in many very large scale applications, such as those associated with machine learning.
more » « less
Full Text Available

Search for: All records